# A brief note on regression.

Date modified: Mon 2023-09-18 . 09:01 AM
Date created: Sat 2023-10-07 . 09:02 PM
# A brief note on regression. This week we have been looking at some data, collecting them, plotting them, and finding "best fit lines" or "best fit curves". I'd like to give you a brief summary on all of this. Regression is the study of dependence. For example: - How does the blood alcohol content (BAC) affect the relative risk of getting in an accident while driving? - How does the frequency change as one go up and down on adjacent keys on the piano? - How does the mean distance to the sun of a planet in our solar system affect its orbital period ("1-year")? - Can we predict the population of Chad in the near future? - Etc. In general we want to know how does an observation / response variable $Y$ depend on a predictor variable $X$. ## Visual representation of data with scatter plot. A first step to deduce such a dependence of $Y$ on $X$, we gather a collection of $N$ data points in the form $(x_{i}, y_{i})$ for $i=1,2,\ldots,N$. We plot them in a **scatter plot** to get a visual representation of the data. For example the [[1 teaching/summer program 2023/puzzles-and-problems/keplers-third-law-and-power-law|data]] Kepler used (from Tycho Brahe) on the mean distance to the sun of the planets in AU and their orbital period (in days), has a corresponding scatter plot as shown (using only the six planets known to Kepler: Mercury, Venus, Earth, Mars, Jupiter, and Saturn): ![[1 teaching/summer program 2023/puzzles-and-problems/---files/Pasted image 20230902141909.png]] A simple regression model is to guess a **linear** relation, that $Y \sim aX+b$, between the response variable $Y$ and the predictor variable $X$. Using the **method of least-squares** one can (or a computer can) find the parameters (slope and intercept) of this line $aX+b$. By using the $Y\sim aX+b$ command in DESMOS, it gives the following line where $a=1164.61$ and $b=-818.423$ : ![[1 teaching/summer program 2023/puzzles-and-problems/---files/Pasted image 20230902142420.png]] Ok, but maybe a couple question arises: 1. Why do we even care for this model? 2. How good is this linear model? ## Why do we care for these models? Let us touch on why we care for these models: > **We can use the model to predict / extrapolate** information that we don't yet have data for! For example here we have the linear model $$ Y = 1164.61 X -818.423 $$ where $Y$ is the orbital period in days and $X$ is the mean distance to the sun in AU. (By the way, 1 AU is one unit distance between the sun and earth.) A planet unknown to Kepler is Neptune (discovered c.1846). Neptune has a mean distance to the sun of about 30 AU. So using this model we can predict Neptune's orbital period by setting $X= 30$, which we get $$ 1164.61(30) - 818.423 \approx 34119.75 $$ Namely, about $34120$ days. Neat! But wait. Is this actually a good model? Modern data gives Neptune's oribtal period to be 165 years, which is 60225 days. We are quite off! This leads to our next point: How can we roughly tell if our model is good or bad? ## Residual plots. A basic idea to see how good our model is for our data is to compute the **residual** for each data point, namely: $$ \text{residual} = \text{data} - \text{predicted} $$ or for our planets in this case with the linear line prediction is: $$ \text{residual} = \text{orbital period} - \text{point on the line} $$ And a good model should have small residuals > Generally, if all the residuals are close to 0, the better the model it is. We make such calculation for each data point, and we get the following: ![[summer program 2023/puzzles-and-problems/---files/Pasted image 20230902145242.png]] Look at some of these residuals. Some have magnitudes of 1000 (days)! That is kind of huge. Also. curiously this residual plot shows a curve pattern, and it seems like a systematic trend here. Another *rule of thumb* is this: > A "good model" ought to have a residual plot that looks roughly **random** about the $Y=0$ line, and showing no systematic trend. So perhaps there is a better model for our planets, and perhaps this is why we got a poor prediction for Neptune's orbital period. ## Some common relations and transformations: Power law and exponential (geometric) law. In physical science and social science, there are two common relations that occur between the predictor variable $X$ and the response variable $Y$ (but not all, of course): 1. Power law: $Y=aX^{b}$ 2. Exponential/geometric law: $Y=ab^{X}$ How might we tell whether we might have power law or exponential law for our data? A simple way is to transform them by **logarithm**. By taking the logarithm of both sides of the equation, we get 1. Power law: $\log Y=\log a + b\log X$ 2. Exponential/geometric law: $\log Y=\log a + X\log b$ In other words: 1. If we have power law, then $(\log X,\log Y)$ should be linear, with slope $b$ and vertical intercept $\log a$. 2. If we have exponential law, then $(X,\log Y)$ should be linear, with slow $\log b$ and vertical intercept $\log a$. In another words: 1. If we have power law, the log-log plot should appear linear. 2. If we have exponential law, the log plot should appear linear. Also, it does not matter what the base of the logarithm you use here is, so long as we are consistent. Often base 10, base e, or base 2 are common. Let us make the log-log plot and log plot for our planets (using base 10): Here is the log-log plot: ![[summer program 2023/puzzles-and-problems/---files/Pasted image 20230902151648.png]] Here is the log plot: ![[summer program 2023/puzzles-and-problems/---files/Pasted image 20230902151749.png]] Notice here that the log-log plot gives something that look linear! This suggests the orbital period and the mean distance to the sun might follow a power law, $Y=aX^{b}$. It is easier to perform a linear regression (by the least-squares method, something you will learn in one of your linear algebra classes, if you are taking one) on data. So performing a linear regression to find a line of best fit for our log-log plot gives a line with slope $1.50314$ and intercept $2.56142$. Let us plot this line onto the log-log plot: ![[summer program 2023/puzzles-and-problems/---files/Pasted image 20230902152513.png]] This looks pretty good! But what does that mean for the power law model $Y=aX^{b}$ before we take the log-log transform? Since this line has slope $1.50314$, by our log-log transformation this means $b=1.50314$; and as this line has intercept $2.56142$, by our log-log transformation this means $\log a =2.56142$, or $a = 10^{2.56142}\approx 364.2671$ (we used base $10$ log here). So this gives a power law model for our planets: $$ Y = 364.2671 \cdot X^{1.50314} $$ where $Y$ is the orbital period in days and $X$ is the mean distance to the sun in AU. Let us test this! First Neptune -- Could we predict Neptune's orbital period better this time? Let us see, Neptune has mean distance to the sun about $30$ AU, so with that plugged in, we get $Y=364.2671(30)^{1.50314}\approx 60498$ days. This is close to 60225 days! Secondly, let us plot this model $Y = 364.2671 \cdot X^{1.50314}$ against our data: ![[summer program 2023/puzzles-and-problems/---files/Pasted image 20230902153828.png]] We see a pretty decent fit! How about the residual plot? That is, take the point and minus the point on the curve: ![[summer program 2023/puzzles-and-problems/---files/Pasted image 20230902154053.png]] Now the residuals have magnitudes of 10 (days), so much better than 1000 (days) from our linear model! Perhaps we still see somewhat of a curve, but what we ought to do is to collect a lot of data and see if any systematic pattern still persist. > Generally, the more data point you have, the better. ## The coefficient of determination $R^{2}$: Another measure of how good is a model. In statistics one calculates several quantities called **statistics** in order to quantify things (assigning numerical values to things). However, sometimes they are interpreted in a somewhat "well, roughly speaking" way. But they are nonetheless often useful. One statistic that we can "well, roughly speaking" how good a model is is the **coefficient of determination**, called $R^{2}$. It is computed by the following: Suppose we want to deduce a relation between the response variable $Y$ and the predictor variable $X$, and we have collected $N$ data points $(x_{i},y_{i})$, with average $y$ value $\bar y = \frac{1}{N}\sum y_{i}$. And suppose we have a model $Y=f(X)$ that we thinks explains the relationship, where $f$ is a function. Then we can compute the following quantity: $$ R^{2} = 1 - \frac{\sum_{i=1}^{N}(y_{i}-f(y_{i}))^{2}}{\sum_{i=1}^{N}(y_{i}-\bar y)^{2}} $$ called the **coefficient of determination**, or **R-squared**. What this "well, roughly tells us" is how good our model is at explaining the dependency between $Y$ and $X$ as compared to the average. If our model $Y= f(X)$ is a perfect model, then the numerator of the fraction would be zero, and $R^{2}$ would be 1. If our model $Y=\bar y$ is just the constant average value, then $R^{2}=0$. > So, an $R^{2}$ value closer to 1 is means our model is pretty good at explaining the dependency beyond using the average value, while an $R^{2}$ closer to 0 means our model is as good as just simply using the average as a prediction. "Well, roughly speaking." One can find more interpretations of $R^{2}$ in literature and on the web. Just remember: Beyond its mathematical definition, they are sometimes "well, roughly speaking" explanations. In any case, this gives an idea of how good our model is, and we can compare two. For our planets example, if we just use a linear model directly in the first case, we get an $R^{2}=0.984572364482$. But with our power law model, we get $R^{2} = 0.999999034346$. Both of these calculated using the $R^{2}$ formula. This suggests the power law model is better than just the linear model! Okay. Hopefully this is enough to get you started in understanding data. The story of course gets deeper as it goes...